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Abstract 

I postulate that human or other intelhgent agents function or should function as 
follows. They store all sensory observations as they come — the data is 'holy.' At 
any time, given some agent's current coding capabilities, part of the data is com- 
pressible by a short and hopefully fast program / description / explanation / world 
model. In the agent's subjective eyes, such data is more regular and more beautiful 
than other data. It is well-known that knowledge of regularity and repeatability 
may improve the agent's ability to plan actions leading to external rewards. In 
absence of such rewards, however, known beauty is boring. Then interestingness 
becomes the first derivative of subjective beauty: as the learning agent improves 
its compression algorithm, formerly apparently random data parts become sub- 
jectively more regular and beautiful. Such progress in data compression is mea- 
sured and maximized by the curiosity drive: create action sequences that extend 
the observation history and yield previously unknown / unpredictable but quickly 
leamable algorithmic regularity. I discuss how all of the above can be naturally im- 
plemented on computers, through an extension of passive unsupervised learning to 
the case of active data selection: we reward a general reinforcement learner (with 
access to the adaptive compressor) for actions that improve the subjective com- 
pressibility of the growing data. An unusually large compression breakthrough 
deserves the name discovery. The creativity of artists, dancers, musicians, pure 
mathematicians can be viewed as a by-product of this principle. Several qualita- 
tive examples support this hypothesis. 

* Joint Invited Lecture for Algorithmic Learning Theory (ALT 2007) and Discovery Science (DS 2007), 
Sendai, Japan (preprint). Valiant to appear in: V. Coixuble, M. Takeda, and E. Suzuki (Eds.): DS 2007, 
LNAI 4755, pp. 26-38, Springer- Verlag Berlin Heidelberg 2007. Abstract: M. Hutter, R.A. Servedio, and 
E. Takimoto (Eds.): ALT 2007, LNAI 4754, pp. 24-25, Springer- Verlag Berlin Heidelberg 2007: see also 
^tp://www.springerlink.com/content/y8j345310757m637/?p=42fbl08af50a4cbf8ec06cl2309884f6.&pi=2 
and |http://www.springer.com/w est/home/generic/search/results?SGWID=4-40 109-22-173760307-0| 
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1 Introduction 



A human lifetime lasts about 3 x 10 seconds. The human brain has roughly 10 
neurons, each with 10^ synapses on average. Assuming each synapse can store not 
more than 3 bits, there is still enough capacity to store the lifelong sensory input stream 
with a rate of roughly 10^ bits/s, comparable to the demands of a movie with reasonable 
resolution. The storage capacity of affordable technical systems will soon exceed this 
value. 

Hence, it is not unrealistic to consider a mortal agent that interacts with an envi- 
ronment and has the means to store the entire history of sensory inputs, which partly 
depends on its actions. This data anchors all it will ever know about itself and its role 
in the world. In this sense, the data is 'holy.' 

What should the agent do with the data? How should it learn from it? Which 
actions should it execute to influence future data? 

Some of the sensory inputs reflect external rewards. At any given time, the agent's 
goal is to maximize the remaining reward or reinforcement to be received before it 
dies. In realistic settings external rewards are rare though. In absence of such rewards 
through teachers etc., what should be the agent's motivation? Answer: It should spend 
some time on unsupervised learning, figuring out how the world works, hoping this 
knowledge will later be useful to gain external rewards. 

Traditional unsupervised learning is about finding regularities, by clustering the 
data, or encoding it through a factorial code iT TTl with statistically independent com- 
ponents, or predicting parts of it from other parts. All of this may be viewed as special 
cases of data compression. For example, where there are clusters, a data point can be 
efficiently encoded by its cluster center plus relatively few bits for the deviation from 
the center. Where there is data redundancy, a non-redundant factorial code 1 14 1 will 
be more compact than the raw data. Where there is predictability, compression can be 
achieved by assigning short codes to events that are predictable with high probability 
Q. Generally speaking we may say that a major goal of traditional unsupervised learn- 
ing is to improve the compression of the observed data, by discovering a program that 
computes and thus explains the history (and hopefully does so quickly) but is clearly 
shorter than the shortest previously known program of this kind. 

According to our complexity-based theory of beauty lfT5l[r7ll25]| . the agent's cur- 
rently achieved compression performance corresponds to subjectively perceived beauty: 
among several sub-patterns classified as 'comparable' by a given observer, the subjec- 
tively most beautiful is the one with the simplest (shortest) description, given the ob- 
server's particular method for encoding and memorizing it. For example, mathemati- 
cians find beauty in a simple proof with a short description in the formal language they 
are using. Others like geometrically simple, aesthetically pleasing, low-complexity 
drawings of various objects 031 [iTl . 

Traditional unsupervised learning is not enough though — it just analyzes and en- 
codes the data but does not choose it. We have to extend it along the dimension of 
active action selection, since our unsupervised learner must also choose the actions 
that influence the observed data, just like a scientist chooses his experiments, a baby its 
toys, an artist his colors, a dancer his moves, or any attentive system its next sensory 
input. 
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Which data should the agent select by executing appropriate actions? Which are 
the interesting sensory inputs that deserve to be targets of its curiosity? I postulate 1251 
that in the absence of external rewards or punishment the answer is: Those that yield 
progress in data compression. What does this mean? New data observed by the learn- 
ing agent may initially look rather random and incompressible and hard to explain. A 
good learner, however, will improve its compression algorithm over time, using some 
application-dependent learning algorithm, making parts of the data history subjectively 
more compressible, more explainable, more regular and more 'beautiful.' A beautiful 
thing is interesting only as long as it is new, that is, as long as the algorithmic regularity 
that makes it simple has not yet been fully assimilated by the adaptive observer who is 
still learning to compress the data better So the agent's goal should be: create action 
sequences that extend the observation history and yield previously unknown / unpre- 
dictable but quickly learnable algorithmic regularity or compressibility. To rephrase 
this principle in an informal way: maximize the first derivative of subjective beauty. 

An unusually large compression breakthrough deserves the name discovery. How 
can we motivate a reinforcement learning agent to make discoveries? Clearly, we can- 
not simply reward it for executing actions that just yield a compressible but boring 
history. For example, a vision-based agent that always stays in the dark will experi- 
ence an extremely compressible and uninteresting history of unchanging sensory in- 
puts. Neither can we reward it for executing actions that yield highly informative but 
uncompressible data. For example, our agent sitting in front of a screen full of white 
noise will experience highly unpredictable and fundamentally uncompressible and un- 
interesting data conveying a lot of information in the traditional sense of Boltzmann 
and Shannon |32|. Instead, the agent should receive reward for creating / observing 
data that allows for improvements of the data's subjective compressibility. 

The appendix will describe formal details of how to implement this principle on 
computers. The next section will provide examples of subjective beauty tailored to 
human observers, and illustrate the learning process leading from less to more subjec- 
tive beauty. Then I will argue that the creativity of artists, dancers, musicians, pure 
mathematicians as well as unsupervised attention in general is just a by-product of our 
principle, using qualitative examples to support this hypothesis. 

2 Visual Examples of Subjective Beauty and its 'First 
Derivative' Interestingness 

Figure [T] depicts the drawing of a female face considered 'beautiful' by some human 
observers. It also shows that the essential features of this face follow a very simple ge- 
ometrical pattern 1 17| to be specified by very few bits of information. That is, the data 
stream generated by observing the image (say, through a sequence of eye saccades) 
is more compressible than it would be in the absence of such regularities. Although 
few people are able to immediately see how the drawing was made without studying 
its grid-based explanation (right-hand side of Figure [T]i, most do notice that the facial 
features somehow fit together and exhibit some sort of regularity. According to our 
postulate, the observer's reward is generated by the conscious or subconscious discov- 
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ery of this compressibility. The face remains interesting until its observation does not 
reveal any additional previously unknown regularities. Then it becomes boring even in 
the eyes of those who think it is beautiful — ^beauty and interestingness are two different 
things. 




Figure 1 : Left: Drawing of a female face based on a previously published construc- 
tion plan [17] (1998). Some human observers report they feel this face is 'beautiful.' 
Although the drawing has lots of noisy details (texture etc) without an obvious short 
description, positions and shapes of the basic facial features are compactly encodable 
through a very simple geometrical scheme. Hence the image contains a highly com- 
pressible algorithmic regularity or pattern describable by few bits of information. An 
observer can perceive it through a sequence of attentive eye movements or saccades, 
and consciously or subconsciously discover the compressibility of the incoming data 
stream. Right: Explanation of how the essential facial features were constructed M TV . 
First the sides of a square were partitioned into 2^ equal intervals. Certain interval 
boundaries were connected to obtain three rotated, superimposed grids based on lines 
with slopes ±1 or ±1/2^ or ±2^/1. Higher-resolution details of the grids were ob- 
tained by iteratively selecting two previously generated, neighbouring, parallel lines 
and inserting a new one equidistant to both. Finally the grids were vertically com- 
pressed by a factor o/l — 2"^. The resulting lines and their intersections define essen- 
tial boundaries and shapes of eyebrows, eyes, lid shades, mouth, nose, and facial frame 
in a simple way that is obvious from the construction plan. Although this plan is simple 
in hindsight, it was hard to find: hundreds of my previous attempts at discovering such 
precise matches between simple geometries and pretty faces failed. 

Figure|2]provides another example: a butterfly and a vase with a flower. The image 
to the left can be specified by very few bits of information; it can be constructed through 
a very simple procedure or algorithm based on fractal circle patterns ifTsl . People who 
understand this algorithm tend to appreciate the drawing more than those who do not. 
They realize how simple it is. This is not an immediate, all-or-nothing, binary process 
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though. Since the typical human visual system has a lot of experience with circles, most 
people quickly notice that the curves somehow fit together in a regular way. But few 
are able to immediately state the precise geometric principles underlying the drawing. 
This pattern, however, is learnable from the right-hand side of Figure|2] The conscious 
or subconscious discovery process leading from a longer to a shorter description of the 
data, or from less to more compression, or from less to more subjectively perceived 
beauty, yields reward depending on the first derivative of subjective beauty. 




Figure 2: Left: Image of a butterfly and a vase with a flower, reprinted from Leonardo 
4751 1251/ . Right: Explanation of how the image was constructed through a very simple 
algorithm exploiting fractal circles HISV . The frame is a circle; its leftmost point is the 
center of another circle of the same size. Wherever two circles of equal size touch or 
intersect are centers of two more circles with equal and half size, respectively. Each 
line of the drawing is a segment of some circle, its endpoints are where circles touch 
or intersect. There are few big circles and many small ones. In general, the smaller a 
circle, the more bits are needed to specify it. The drawing to the left is simple ( com- 
pressible) as it is based on few, rather large circles. Many human observers report 
that they derive a certain amount of pleasure from discovering this simplicity. The ob- 
server's learning process causes a reduction of the subjective complexity of the data, 
yielding a temporarily high derivative of subjective beauty. (Again I needed a long time 
to discover a satisfactory way of using fractal circles to create a reasonable drawing.) 
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3 Compressibility-Based Rewards of Art and Music 



The examples above indicate that works of art and music may have important pur- 
poses beyond their social aspects [1] despite of those who classify art as superfluous 
lllOJ . Good observer-dependent art deepens the observer's insights about this world or 
possible worlds, unveiling previously unknown regularities in compressible data, con- 
necting previously disconnected patterns in an initially surprising way that makes the 
combination of these patterns subjectively more compressible, and eventually becomes 
known and less interesting. I postulate that the active creation and attentive perception 
of all kinds of artwork are just by-products of my curiosity principle yielding reward 
for compressor improvements. 

Let us elaborate on this idea in more detail, following the discussion in ESl . Artifi- 
cial or human observers must perceive art sequentially, and typically also actively, e.g., 
through a sequence of attention-shifting eye saccades or camera movements scanning 
a sculpture, or internal shifts of attention that filter and emphasize sounds made by a 
pianist, while surpressing background noise. Undoubtedly many derive pleasure and 
rewards from perceiving works of art, such as certain paintings, or songs. But differ- 
ent subjective observers with different sensory apparati and compressor improvement 
algorithms will prefer different input sequences. Hence any objective theory of what is 
good art must take the subjective observer as a parameter, to answer questions such as: 
Which action sequences should he select to maximize his pleasure? According to our 
principle he should select one that maximizes the quickly learnable compressibility that 
is new, relative to his current knowledge and his (usually Umited) way of incorporating 
or learning new data. 

For example, which song should some human observer select next? Not the one he 
just heard ten times in a row. It became too predictable in the process. But also not 
the new weird one with the completely unfamiliar rhythm and tonality. It seems too 
irregular and contain too much arbitrariness and subjective noise. He should try a song 
that is unfamiliar enough to contain somewhat unexpected harmonies or melodies or 
beats etc., but familiar enough to allow for quickly recognizing the presence of a new 
learnable regularity or compressibility in the sound stream. Sure, this song will get 
boring over time, but not yet. 

The observer dependence is illustrated by the fact that Schonberg's twelve tone 
music is less popular than certain pop music tunes, presumably because its algorithmic 
structure is less obvious to many human observers as it is based on more complicated 
harmonies. For example, frequency ratios of successive notes in twelve tone music 
often cannot be expressed as fractions of very small integers. Those with a prior ed- 
ucation about the basic concepts and objectives and constraints of twelve tone music, 
however, tend to appreciate Schonberg more than those without such an education. 

All of this perfectly fits our principle: The current compressor of a given subjective 
observer tries to compress his history of acoustic and other inputs where possible. The 
action selector tries to find history-influencing actions that improve the compressor's 
performance on the history so far The interesting musical and other subsequences 
are those with previously unknown yet learnable types of regularities, because they 
lead to compressor improvements. The boring patterns are those that seem arbitrary or 
random, or whose structure seems too hard to understand. 
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Similar statements not only hold for other dynamic art including film and dance 
(taking into account the compressibility of controller actions), but also for painting and 
sculpture, which cause dynamic pattern sequences due to attention-shifting actions 13TI 
of the observer. 

Just as observers get intrinsic rewards from sequentially focusing attention on art- 
work that exhibits new, previously unknown regularities, the 'creative' artists get re- 
ward for making it. For example, I found it extremely rewarding to discover (after 
hundreds of frustrating failed attempts) the simple geometric regularities that permit- 
ted the construction of the drawings in Figures [T] and |2l The distinction between artists 
and observers is not clear though. Artists can be observers and vice versa. Both artists 
and observers execute action sequences. The intrinsic motivations of both are fully 
compatible with our simple principle. Some artists, however, crave external reward 
from other observers, in form of praise, money, or both, in addition to the internal re- 
ward that comes from creating a new work of art. Our principle, however, conceptually 
separates these two types of reward. 

From our perspective, scientists are very much like artists. They actively select 
experiments in search for simple laws compressing the observation history. For exam- 
ple, different apples tend to fall off their trees in similar ways. The discovery of a law 
underlying the acceleration of all falling apples helps to greatly compress the recorded 
data. 

The framework in the appendix is sufficiently formal to allow for implementation 
of our principle on computers. The resulting artificial observers will vary in terms of 
the computational power of their history compressors and learning algorithms. This 
will influence what is good art / science to them, and what they find interesting. 

A Appendix 

This appendix is a compactified, compressibility-oriented variant of parts of ll25l . 

The world can be explained to a degree by compressing it. The compressed ver- 
sion of the data can be viewed as its explanation. Discoveries correspond to large data 
compression improvements (found by the given, application-dependent compressor im- 
provement algorithm). How to build an adaptive agent that not only tries to achieve 
externally given rewards but also to discover, in an unsupervised and experiment-based 
fashion, explainable and compressible data? (The explanations gained through explo- 
rative behavior may eventually help to solve teacher-given tasks.) 

Let us formally consider a learning agent whose single life consists of discrete 
cycles or time steps t = 1, 2, . . . , T. Its complete lifetime T may or may not be 
known in advance. In what follows, the value of any time-varying variable Q at time t 
(1 < t < T) will be denoted by Q{t), the ordered sequence of values (5(1), . . . , Q{t) 
by Q{<t), and the (possibly empty) sequence (5(1), . . . , Q{t — 1) by Q{< t). At any 
given t the agent receives a real-valued input x{t) from the environment and executes 
a real-valued action y{t) which may affect future inputs. At times t < T its goal is to 
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maximize future success or utility 



u{t) = E^ J2 r{T) h{<t) , (1) 

_T = t+l 

where r{t) is an additional real-valued reward input at time t, h{t) the ordered triple 
[x{t),y{t),r{t)] (hence h{< t) is the known history up to t), and E^{- \ ■) denotes 
the conditional expectation operator with respect to some possibly unknown distribu- 
tion /i from a set M of possible distributions. Here A4 reflects whatever is known 
about the possibly probabilistic reactions of the environment. For example, A4 may 
contain all computable distributions l'33l l34l l9l l4l . There is just one life, no need for 
predefined repeatable trials, no restriction to Markovian interfaces between sensors and 
environment, and the utility function implicitly takes into account the expected remain- 
ing lifespan E^{T \ h{< t)) and thus the possibility to extend it through appropriate 
actions Il23ll26ll24l. 

Recent work has led to the first learning machines that are universal and optimal 
in various very general senses H |23] |26l |22l|28] |29l • Such machines can in principle 
find out by themselves whether curiosity and world model construction are useful or 
useless in a given environment, and learn to behave accordingly. The present appendix, 
however, will assume a priori that compression / explanation of the history is good 
and should be done; here we shall not worry about the possibility that 'curiosity may 
kill the cat.' Towards this end, in the spirit of our previous work lfT2l fTTl [35l [161 
[TSl, we split the reward signal r{t) into two scalar real-valued components: r{t) = 
gifextit) , fintit)), where g maps pairs of real values to real values, e.g., g{a,b) = 
a + b. Here r^^t (t) denotes traditional external reward provided by the environment, 
such as negative reward in response to bumping against a wall, or positive reward in 
response to reaching some teacher-given goal state. But I am especially interested 
in Tint (t), the internal or intrinsic or curiosity reward, which is provided whenever the 
data compressor / internal world model of the agent improves in some sense. Our initial 
focus will be on the case rext (t) — for all valid t. The basic principle is essentially 
the one we pubHshed before in various variants ifTT] [T2l |35] [161 [TSl l22l IBl : 

Principle 1 Generate curiosity reward for the controller in response to improvements 
of the history compressor 

So we conceptually separate the goal (explaining / compressing the history) from the 
means of achieving the goal. Once the goal is formally specified in terms of an algo- 
rithm for computing curiosity rewards, let the controller's reinforcement learning (RL) 
mechanism figure out how to translate such rewards into action sequences that allow 
the given compressor improvement algorithm to find and exploit previously unknown 
types of compressibility. 

A.l Predictors vs Compressors 

Most of our previous work on artificial curiosity was prediction-oriented, e. g., ATI [121 
Prediction and compression are closely related though. A predictor 



8 



that correctly predicts many x{t), given history h{< t), for 1 < r < i, can be used to 
encode h{< t) compactly: Given the predictor, only the wrongly predicted x{t) plus 
information about the corresponding time steps r are necessary to reconstruct history 
h{< t), e.g., ifm . Similarly, a predictor that learns a probability distribution of the 
possible next events, given previous events, can be used to efficiently encode obser- 
vations with high (respectively low) predicted probability by few (respectively many) 
bits |I3][30|, thus achieving a compressed history representation. Generally speaking, 
we may view the predictor as the essential part of a program p that re-computes h{<t). 
If this program is short in comparison to the rad data h{<t), then h{<t) is regular or 
non-random If33i r7li9 19|, presumably reflecting essential environmental laws. Thenp 
may also be highly useful for predicting future, yet unseen x{t) for t > t. 

A.2 Compressor Performance Measures 

At any time t (1 < t < T), given some compressor program p able to compress 
history h{< t), let C{p, h{< t)) denote p's compression performance on h{< t). An 
appropriate performance measure would be 

Ci{p,hi<t)) = l{p), (2) 

where l{p) denotes the length of p, measured in number of bits: the shorter p, the 
more algorithimic regularity and compressibility and predictability and lawfulness in 
the observations so far. The ultimate limit for Ci{p, h{< t)) would be K*{h{< t)), 
a variant of the Kolmogorov complexity of h{< t), namely, the length of the shortest 
program (for the given hardware) that computes an output starting with h{<t) Il33l l7l 

mm- 

Ci{p, h{< t)) does not take into account the time t{p, h{< t)) spent by p on com- 
puting h{< t). An alternative performance measure inspired by concepts of optimal 
universal search |l8l|2l] is 

Cir{p,hi<t))=l{p)+ log T{p,hi<t)). (3) 

Here compression by one bit is worth as much as runtime reduction by a factor of i. 

A.3 Compressor Improvement Measures 

The previous Section IA.2I only discussed measures of compressor performance, but 
not of performance improvement, which is the essential issue in our curiosity-oriented 
context. To repeat the point made above: The important thing are the improvements 
of the compressor, not its compression performance per se. Our curiosity reward in 
response to the compressor's progress (due to some application-dependent compressor 
improvement algorithm) between times t and t + 1 should be 

nnt{t + 1) = f[C{p{t + 1), h{<t + 1)), C{p{t), h{<t + 1))], (4) 

where / maps pairs of real values to real values. Various alternative progress measures 
are possible; most obvious is f{a,b) — a — b. 

Note that both the old and the new compressor have to be tested on the same data, 
namely, the complete history so far 
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A.4 Asynchronous Framework for Creating Curiosity Reward 

Let p{t) denote the agent's current compressor program at time t, s{t) its current con- 
troller, and do: 

Controller: At any time t(l <t <T)do: 

1. Let s{t) use (parts of) history h{< t) to select and execute y{t + \). 

2. Observe x{t + 1). 

3. Check if there is non-zero curiosity reward rint + 1) provided by the separate, 
asynchronously running compressor improvement algorithm (see below). If not, 

set r,„t(t + 1) = 0. 

4. Let the controller's reinforcement learning (RL) algorithm use h{< t + 1) in- 
cluding rint{t + 1) (and possibly also the latest available compressed version of 
the observed data — see below) to obtain a new controller s(t + 1), in line with 
objective ([T]l. 

Compressor: Set pnew equal to the initial data compressor Starting at time 1, repeat 
forever until interrupted by death T: 

1. Set poid = Pnew', get Current time step t and set hoid = h{<t). 

2. Evaluate poid on hoidAo obtain C{poid, hold) (Section|A!2). This may take many 
time steps. 

3. Let some (application-dependent) compressor improvement algorithm (such as 
a learning algorithm for an adaptive neural network predictor) use hoid to ob- 
tain a hopefully better compressor Pnew (such as a neural net with the same size 
but improved prediction capability and therefore improved compression perfor- 
mance). Although this may take many time steps, Pnew may not be optimal, due 
to Umitations of the learning algorithm, e.g., local maxima. 

4. Evaluate Pnew on hoid, to obtain C{pnew , hoid)- This may take many time steps. 

5. Get current time step t and generate curiosity reward 

nntir) = f[C(j)old, hold), C{pnew, hold)], (5) 

e.g., f{a, b) = a — b; see Section lA3] 

Obviously this asynchronuous scheme may cause long temporal delays between con- 
troller actions and corresponding curiosity rewards. This may impose a heavy burden 
on the controller's RL algorithm whose task is to assign credit to past actions (to in- 
form the controller about beginnings of compressor evaluation processes etc., we may 
augment its input by unique representations of such events). Nevertheless, there are 
RL algorithms for this purpose which are theoretically optimal in various senses, to be 
discussed next. 
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A.5 Optimal Curiosity & Creativity & Focus of Attention 



Our chosen compressor class typically will have certain computational limitations. In 
the absence of any external rewards, we may define optimal pure curiosity behavior 
relative to these limitations; At time t this behavior would select the action that maxi- 
mizes 



Since the true, world-governing probability distribution is unknown, the resulting 
task of the controller's RL algorithm may be a formidable one. As the system is revis- 
iting previously uncompressible parts of the environment, some of those will tend to 
become more compressible, that is, the corresponding curiosity rewards will decrease 
over time. A good RL algorithm must somehow detect and then predict this decrease, 
and act accordingly. Traditional RL algorithms |6|, however, do not provide any the- 
oretical guarantee of optimality for such situations. (This is not to say though that 
sub-optimal RL methods may not lead to success in certain applications; experimental 
studies might lead to interesting insights.) 

Let us first make the natural assumption that the compressor is not super-complex 
such as Kolmogorov's, that is, its output and rint{t) are computable for all t. Is there 
a best possible RL algorithm that comes as close as any other to maximizing objective 
^1 Indeed, there is. Its drawback, however, is that it is not computable in finite time. 
Nevertheless, it serves as a reference point for defining what is achievable at best. 

A.6 Optimal But Incomputable Action Selector 

There is an optimal way of selecting actions which makes use of Solomonoff 's theo- 
retically optimal universal predictors and their Bayesian learning algorithms ||33l [34l 
|9] m |5]. The latter only assume that the reactions of the environment are sampled 
from an unknown probability distribution ji contained in a set M of all enumerable 
distributions — compare text after equation ([T]i- More precisely, given an observation 
sequence q{< t), we only assume there exists a computer program that can compute 
the probability of the next possible q{t + 1), given q{< t). In general we do not know 
this program, hence we predict using a mixture distribution 



a weighted sum of all distributions /i^ G j = 1, 2, . . ., where the sum of the con- 
stant weights satisfies Wi < 1. This is indeed the best one can possibly do, in a 
very general sense fST, "41. The drawback of the scheme is its incomputability, since 
Ai contains infinitely many distributions. We may increase the theoretical power of 
the scheme by augmenting Ai by certain non-enumerable but limit-computable distri- 
butions [T9l, or restrict it such that it becomes computable, e.g., by assuming the world 
is computed by some unknown but deterministic computer program sampled from the 
Speed Prior 1201 which assigns low probability to environments that are hard to com- 
pute by any method. 




(6) 



(7) 
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Once we have such an optimal predictor, we can extend it by formally including 
the effects of executed actions to define an optimal action selector maximizing future 
expected reward. At any time t, Hutter's theoretically optimal (yet uncomputable) RL 
algorithm Aixi [|4J uses an extended version of Solomonoff's prediction scheme to 
select those action sequences that promise maximal future reward up to some horizon 
T, given the current data h{< t). That is, in cycle t + 1, Aixi selects as its next action 
the first action of an action sequence maximizing ^-predicted reward up to the given 
horizon, appropriately generalizing eq. O. Aixi uses observations optimally |4|: the 
Bayes-optimal policy based on the mixture ^ is self-optimizing in the sense that 
its average utility value converges asymptotically for all ^ G to the optimal value 
achieved by the Bayes-optimal policy which knows ji in advance. The necessary 
and sufficient condition is that admits self-optimizing policies. The policy is 
also Pareto-optimal in the sense that there is no other policy yielding higher or equal 
value in all environments v ^ M and a strictly higher value in at least one 141. 

A.7 Computable Selector of Provably Optimal Actions, Given Cur- 
rent System 

Aixi above needs unlimited computation time. Its computable variant Aixiff,/) H has 
asymptotically optimal runtime but may suffer from a huge constant slowdown. To 
take the consumed computation time into account in a general, optimal way, we may 
use the recent Godel machines Il23l l26l I24II instead. They represent the first class of 
mathematically rigorous, fully self-referential, self-improving, general, optimally effi- 
cient problem solvers. They are also applicable to the problem embodied by objective 

The initial software S of such a Godel machine contains an initial problem solver, 
e.g., some typically sub-optimal method |6|. It also contains an asymptotically optimal 
initial proof searcher based on an online variant of Levin's Universal Search [ 8 1, which 
is used to run and test proof techniques. Proof techniques are programs written in a 
universal language implemented on the Godel machine within S. They are in principle 
able to compute proofs concerning the system's own future performance, based on an 
axiomatic system A encoded in S. A describes the formal utility function, in our case 
eq. (|6j, the hardware properties, axioms of arithmetic and probability theory and data 
manipulation etc, and S itself, which is possible without introducing circularity ||26l . 

Inspired by Kurt Godel's celebrated self-referential formulas (1931), the Godel ma- 
chine rewrites any part of its own code (including the proof searcher) through a self- 
generated executable program as soon as its Universal Search variant has found a proof 
that the rewrite is useful according to objective According to the Global Optimal- 
ity Theorem ll23l l26l l24l . such a self -rewrite is globally optimal — no local maxima 
possible! — since the self -referential code first had to prove that it is not useful to con- 
tinue the search for alternative self-rewrites. 

If there is no provably useful optimal way of rewriting S at all, then humans will 
not find one either But if there is one, then S itself can find and exploit it. Un- 
like the previous no«-self -referential methods based on hardwired proof searchers Q, 
Godel machines not only boast an optimal order of complexity but can optimally re- 
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duce (through self-changes) any slowdowns hidden by the 0()-notation, provided the 
utility of such speed-ups is provable. 



A.8 Consequences of Optimal Action Selecton 

Now let us apply any optimal RL algorithm to curiosity rewards as defined above. 
The expected consequences are: at time t the controller will do the best to select an 
action y{t) that starts an action sequence expected to create observations yielding max- 
imal expected compression progress up to the expected death T, taking into accunt 
the limitations of both the compressor and the compressor improvement algorithm. In 
particular, ignoring issues of computation time, it will focus in the best possible way 
on things that are currently still uncompressible but will soon become compressible 
through additional learning. It will get bored by things that already are compressible. 
It will also get bored by things that are currently uncompressible but will apparently 
remain so, given the experience so far, or where the costs of making them compressible 
exceed those of making other things compressible, etc. 
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